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Abstract 

When modeling geostatistical or areal data, spatial structure is commonly accom¬ 
modated via a covariance function for the former and a neighborhood structure for the 
latter. In both cases the resulting spatial structure is a consequence of implicit spatial 
grouping in that observations near in space are assumed to behave similarly. It would 
be desirable to develop spatial methods that explicitly model the partitioning of spa¬ 
tial locations providing more control over resulting spatial structures and being able 
to better balance global vs local spatial dependence. To this end, we extend product 
partition models to a spatial setting so that the partitioning of locations into spatially 
dependent clusters is explicitly modeled. We explore the spatial structures that result 
from employing a spatial product partition model and demonstrate its flexibility in 
accommodating many types of spatial dependencies. We illustrate the method’s utility 
through simulation studies and an education application. Computational techniques 
with additional simulations and examples are provided in a Supplementary Material 
hie available online. 
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1 Introduction 


Research dedicated to developing statistical methodologies that in some way incorporate 
information relating to location has grown exponentially in the last decade. In fact, spatial 
methods are now available in essentially all areas of statistics and have been developed 
to accommodate both areal (lattice) and geo-referenced data. The principal motivation in 
developing these methods is to produce inference and predictions that take into account the 
spatial dependence that is believed to exist among observations. The end result is typically 
a smoothed map for areal data or a predictive map for geo-referenced data. These maps 
are frequently produced by implicitly performing a type of spatial grouping that carries out 
the intuitively appealing notion that responses measured at locations near in space have 
similar values. Since the grouping is implicit, the spatial partition is not directly modeled 
but is a consequence of model choices (e.g., neighborhood structure or covariance function). 


For areal data this can lead to spatial correlation structures that are counter-intuitive (Wall 


2004). Additionally, it is common that the smoothed or predictive maps are global in nature 


in that methods are not flexible enough to capture local deviations from an overall spatial 
structure. 



Figure 1: Synthetic spatial helds. From left to right, the graphs display random helds that 
become progressively more local. 


Figure [T] provides a synthetic example of local vs. global spatial dependence. The three 
plots were generated using a Gaussian process featuring an exponential covariance function. 
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From left to right the random fields become increasingly more local. The left plot displays 
one spatial process over the entire domain that has expectation 0, nngget 0.1, partial sill 


2, and effective range 6 (see Banerjee et al. 2014, Chapter 2 for more details). The second 
plot is generated with the same covariance fnnction, bnt the field is partitioned into fonr 
rectangnlar clnsters and each is assigned a specific constant mean (1,—0.5, 0.25,—1), thns 
indncing a small amonnt of local strnctnre. The right plot is the most local of the three 
as each clnster is a realization from a nniqne spatial process that has expectation 0 and 
a clnster specific partial sill (1,2, 3,4) and effective range (0.5,10,5,20). Methods able to 
flexibly captnre these three strnctnres wonld certainly be appealing. Developing these types 
of methods is the primary focns of this paper. 

Onr approach is to develop a class of priors based on prodnct partition models (PPM, 
Hartigan [1990 ) that directly model the partitioning of locations into spatially dependent 


clnsters. Making the PPM location dependent is necessary in a spatial setting becanse if 
not, then locations that are very far apart conld possibly be assigned to the same clnster 
with high probability. As a conseqnence, the marginal correlation between observations far 
apart conld be stronger than that of observations near each other, which rnns connter to 
correlation strnctnres often desired in spatial modeling. As will be seen, PPM’s are a very 
attractive way to partition spatial nnits as they are extremely flexible in accommodating 
different types of spatial clnsters. 

The method we develop is able to adapt to the three scenarios described in Fignre[T]by 
incorporating spatial information in two ways. The first is via a prior on the partitioning of 
locations nsing PPM ideas. The second is throngh the likelihood either directly or hierar¬ 
chically. If spatial strnctnre is not bnilt in the likelihood, the spatial PPM will marginally 
indnce local spatial dependence among observations. As an aside, apart from more accnrately 
modeling spatial phenomena, considering local spatial dependance potentially provides large 
compntational gains as covariance matrices are considerably smaller. 

Spatial methods now have a large presence in the statistical literatnre. We focns on 
methods that incorporate spatial dependence flexibly. For a general overview of spatial 
methods see Gelfand et al. (2010), Banerjee et al. (2014), or Schabenberger and Gotway 
( |2005| . 

Locating spatial clnsters is commonly considered in spatial point processes ( Diggle||2014 ). 
That said, from a modeling standpoint, the analysis goals are completely different from those 
we consider. Image segmentation is an extensively stndied area that we do not attempt to 
fnlly snrvey here. We do mention the spatial distance dependent Ghinese restanrant process 
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of Ghosh et ah (2011) (a spatial extension of the distance dependent Chinese restanrant 
process of Blei and Frazierj2011) as they develop a process that prodnces a non-exchangeable 
distribntion on location dependent partitions throngh a distance dependent decay fnnction. 
Thongh there are similarities, onr approach is model based and therefore provides measnres 
of nncertainty regarding inferences and predictions. 


Gelfand et ah (2005) developed a spatial Dirichlet process (DP) by modeling atoms asso¬ 


ciated with Sethnraman (1994)’s stick-breaking random measnre constrnction with a random 
field. Dnan et ah (2007) generalized the spatial DP throngh a type of mnltivariate stick¬ 


breaking in which individnal sites conld possibly arise from nniqne snrfaces introdncing a type 


of local spatial modeling. Both spatial DP processes reqnire replication. Griffin and Steel 


(2006) developed the ordered dependent DP where stick breaking weights are randomly per- 


mnted according to a latent spatial point process thns indncing spatial dependence. Petrone 


et ah (2009) developed a DP that pieces together fnnctions and applied it to a spatial field. 


Reich and Bondell (2011) nse a DP to model locations directly resnlting in spatially refer¬ 


enced clnsters. All of these methods indnce a marginal distribntion on partitions throngh 
the introdnction of latent cluster labels. 

Somewhat related to the spatial DP and operationally similar to what we introduce 


are the spatial stick-breaking process of Reich and Puentes (2007) and the logistic stick¬ 


breaking process of Ren et ah (2011) (both of which are in some sense special cases of kernel- 
stick breaking process of Dunson and Park 2008). Both stick-breaking processes induce 
spatial dependence via kernel functions that allow stick-breaking weights to change with 
space. A related probit-stick breaking prior for spatial dependence was recently proposed in 


Papageorgiou et ah (2014). 


Other authors have employed DP type methods to areal data resulting in a more flexible 


(local) neighborhood structure (Li et ah 2014, Lee et ah 2014). Kang et ah (2014) created 
local conditional autoregressive (GAR) models to accommodate local spatial residual. 

Even though all the previously mentioned nonparametric Bayes based methods may have 
some inferential similarities or are at least operationally similar to what we are proposing, 
they are fundamentally different. We do not introduce any notion of a random probability 
measure. Therefore, we are not bound to an induced marginal model on partitions available 
from the DP (though this particular model is certainly available as a special case). Instead 
we directly model the spatially dependent partition using a PPM. Doing so provides much 
more control over the partitioning of spatial units into clusters. 


From a disease mapping perspective, Denison and liolnies (2001) consider spatial clus- 
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tering by first selecting cluster centroids and using tessellation ideas of [Lawson and Denison 


(2002) to determine cluster memberships. This requires employing Reversible Jump MCMC 


and produces spatial clusters that are necessarily convex. Knorr-Held and Raher (2000) 


cluster areal units via a distance measure that is based on shared boundaries. Hegarty and 


Barry (2008) employ a PPM to model partitions of areal units, though they do not explore 


the spatial properties of their model and are restricted to a very specific setting. We aim to 
propose a very general methodology that is flexible in accommodating many types of spatial 
dependencies. In fact, we will show that once a model for the partition has been specified, 
the sky is limit in terms of how spatial dependence can be incorporated in other parts of the 
model. 

The remainder of the article is organized as follows. In Section 2 we provide some 
preliminaries on PPM’s and a bit of discussion on spatial clustering. Section 3 details 
spatial extensions of the PPM and investigates spatial properties. Section 4 contains a 
small simulation study and a Chilean education data application. We make some concluding 
remarks in Section 5. Lastly, the Supplementary Material file available online contains 
computational details along with additional simulations and applications. 

2 Preliminaries 

We provide background to PPM’s and a bit of discussion motivating our view of spatial 
clusters. 

2.1 Preliminaries of Product Partition Model 


PPM’s were first introduced by Hartigan (1990) and have since been extended to include 


covariates ( Muller et al.||2011 and Park and Dunson 2010) and correlated parameters (Mon- 
teiro et ^[2011 ). They’ve been employed in applications ranging from change point analysis 


(Barry and Hartigan 1992) to functional clustering (Page and Quintana 2014) among oth¬ 


ers. Since PPMs are central to our approach of carrying out spatial clustering, we briefly 
introduce them here. Consider n distinct locations denoted by Si,..., s„. The s* are quite 
general in that they can be latitude and longitude values or in the case of areal data they 
could define a neighborhood structure. The goal is to directly model the partitioning of the 
Sj, z = 1,..., n into groups. With this in mind, let = {51,..., denote a partition¬ 
ing (or clustering) of the n locations into subsets such that i & Sh implies that location i 
belongs to cluster h. Alternatively, we will denote cluster membership using ci,..., c„ where 
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Ci = h implies i G Sh- Then the PPM prior for p is simply 


Pr(p)o>:nC(S»), 


( 2 , 1 ) 


h=l 


where C{Sh) > 0 for Sh C {1,..., n} is a cohesion function that measures how likely elements 


of Sh are clustered a priori. The normalizing constant of (2.1) is simply the sum of (2.1) 


over all possible partitions. A popular cohesion function that connects ( 2.1[ ) to the marginal 
prior distribution on partitions induced by a Dirichlet process (DP) is C{S) = M x P(IS'I). 
This cohesion produces a PPM that encourages partitions with a small number of large 
clusters and also a few smaller clusters (the rich get richer property). This property will be 
useful to avoid creating many singleton clusters when extending PPM’s to a spatial setting 
and therefore the form M x rdS*!) will be used regularly. Eventually we will consider a 
response and covariate vector measured at each location which will be denoted by y{si) and 
x{si) respectively. Finally, it will be necessary to make reference to partitioned location and 
response vectors which we denote by = {sj : i G Sh} and = {y{si) : i G Sh}- 


2.2 Spatial Clustering 

Before proceeding, we expound on the term “spatial cluster” and make its dehnition used in 
this paper concrete (for more discussion on the subject of spatial clusters see 
Chapter 6). Typically, clustering attempts to group or partition individuals or experimental 
units based on some measured response variable. Therefore, the resulting partition consists 
of clusters whose members are fairly homogenous with respect to the measured response. 
How cluster boundaries are dehned (e.g., elliptical, convex) is crucial to the resulting parti¬ 
tion and to our knowledge no universally agreed upon dehnition exists. When in addition 
to a measured response, the proximity of individuals or experimental units inhuences the 
partitioning of individuals, then we refer to these clusters as “spatial”. 

If spatial structure exists among the realizations of some response variable measured at 
various locations, then the values measured at locations near each other should be more 
similar than those that are far apart. However, this doesn’t exclude the possibility of two 
individuals far apart producing similar responses. Clustering in the absence of spatial infor¬ 
mation would group these two individuals together (as would be the case in a non-spatial 
PPM). From a spatial perspective it seems more natural that locations far from each other 
would not belong to the same cluster. That is, spatial clusters should be in some sense “lo- 


Lawson 2013 
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cal” in that locations that belong to the same cluster should share a boundary for areal data 
(or comply with some other neighborhood structure) or attain a pre-determined minimum 
distance with other members of the cluster for geo-referenced data. We make this concrete 
with the following dehnition. 

Definition 2.1. Consider corresponding to cluster Sh C n} and let be a 

metric in the space of spatial coordinates. We say that cluster Sh is spatially connected if 
there does not exist Sj/ ^ such that for all Sj, Sj G where Sj ^ Si, d{si/, Si) < d{sj, Si). 
A partition will be called spatially connected if all of its clusters are spatially connected. 

Figure provides four spatial plots of regular grids that assist in visualizing spatially 
connected clusters. The top left plot is an example of convex clusters that are connected 
while the top right plot contains connected clusters one of which is concave. The bottom left 
plot is an example of a partition that is not connected as the cluster of triangle points has 
been split by the cluster of square points. The bottom right plot is an example of clusters 
that are connected even though there exists a singleton island cluster. 

Our vision of spatial clusters does not necessarily partition the spatial domain into disjoint 
sets. Because clusters possibly depend on variables other than location, it is possible that 
two clusters exist in the same geographical region. The presence of these “stacked” clusters 
seems common and a perk of the methodology we develop. 

3 Methodological Development 

We now detail spatial extensions to the basic PPM (here after referred to as sPPM) and 
investigate cluster membership probabilities. Also, we show that combining sPPM with 
likelihoods (that potentially include spatial information) produce marginal spatial structures 
with appealing properties (e.g., non-stationary) and balance local vs. global structure. As 
both cluster membership probabilities and correlations depend on the cohesion function we 
propose a few reasonable candidates. 
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Figure 2: Regular grids that provide an illustration of spatial connectedness. The top two 
hgures display partitions that are spatially connected with the left demonstrating concave 
clusters and the right convex. The bottom left graph illustrates a partition that is not 
spatially connected as the green cluster is not spatially connected since it has been completely 
separated by the red. The partition in the bottom right hgure is spatially connected even 
though there exists an island (singleton) cluster. 













3.1 Cohesion Functions 


Extending the PPM to incorporate spatial information requires making the cohesion of (2.1) 
a function of location. With this in mind, consider 


Pr{p) oc ]^C(S'h, s 


1 ). 


(3.1) 


h=l 


which makes the clustering process location dependent. (This is structurally similar to Park 


and Dunson 2010 s approach to extending the PPM to incorporate covariates.) Defining a 
cohesion function that only admits spatially connected partitions is conceptually straight¬ 
forward. For example, one could employ 


C(^, si) = 


M X rdS*!) if S' is spatially connected 
0 otherwise. 


where M x Eds'!) is used to favor a small number of large clusters with the number of clusters 
being regulated by M. A cohesion function defined in this way places zero prior mass on 
partitions that are not spatially connected. Although this definition is intuitively appealing, 
it is particularly challenging to implement from a computational stand point and can only 
realistically be considered for a small number of locations. Therefore, we suggest considering 
cohesion functions that assign small probabilities to partitions with clusters that are not 
spatially connected. A nice feature of the sPPM is that there are many ways in which this 
can be carried out and we introduce four reasonable candidates. Subsequently, we study the 
spatial properties of each one. 

As we introduce the first cohesion function keep in mind that our overarching goal is 
to develop a prior that favors spatially connected partitions without creating a bunch of 
singleton clusters. One way to carry this out is by employing tessellation ideas found in 


Denison and Holmes (2001) in that distances to a cluster centroid are considered. To this 
end, let denote the centroid of cluster Sh and = Yhi&Sh ^h) fhe sum of all distances 
from the centroid (unless otherwise stated we use Euclidean norm || ■ ||). Defining the cohesion 
as a decreasing function of Vh would certainly produce small local clusters. Unfortunately, 
cohesions that favor clusters with small Vh would also produce partitions with many singleton 
clusters. To counteract this, we make the cohesion a function of M x r(| S'/ll) in addition 
to Vh- Now since r(| S'/ll) would overwhelm Vh as cluster membership grows, we consider 
T['Dh)l\Ph > 1] + PhVlPh < !]• (The partitioning of D/i’s domain was motivated by the fact 
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that the gamma function is not monotone on [0,1] and does not tend to zero as Vh tends to 
zero). Finally, to provide a bit more control over the penalization of distances, we introduce 
a user supplied tuning parameter, a, resulting in the following cohesion function 

f Mxr{\s,\) 

C,{S,, 4) = I r{aV,)I[V, > 1] + {V,)I[V, < 1] ' ( 3 , 2 ) 

[ M if = 1. 

We set Ci{Sh, sD = M for |S'/i| = 1 to avoid issues associated with Vh = 0. Notice that since 
all si, ..., are distinct Vh = 0 \Sh\ = 1- Further, when \Sh\ = 1, M x FdS'/il) = M 

justifying in a sense setting the cohesion to M when \Sh\ = 1- 

The second cohesion function we consider provides a hard cluster boundary and for some 
pre-specihed a > 0 has the following form 


C 2 {Sh, sl) = Mx F(|4|) X J] I[||s, - s,-|| < a], 

i,j&Sh 


(3.3) 


Once again, M x F(|S'/i|) is included to inherit the “rich get richer” property of DP partition¬ 
ing. This cohesion is amenable to neighborhood structures of areal data modeling. Instead 
of Sj) < a], one could use I[i ~ j] where i ^ j indicates that Sj and Sj are neighbors 

according to some neighborhood structure. If a data dependent neighborhood structure is 
desired, one could introduce auxiliary variables in the cohesion and employ ideas similar to 


those found in Kang et ah (2014). 


sPPM under Ci and C 2 produces a completely valid joint distribution over partitions 
that is quite general. In fact, since the cohesions are functions of not only \Sh\ but also of 
sPPM relaxes exchangeability assumptions. However, for this same reason sPPM under 
Cl and C 2 does not inherit the PPM ( 2.1| )’s property of being coherent across sample sizes. 
That is, P{pn) 7 ^ Yl^h=i P{Pn, Cn+i = h). This is easily seen as the location of s„+i influences 
P{PniCn+i = j)- Although this does not change the fact that the sPPM produces a valid 
joint distribution over partitions, for computational purposes it is sometimes desirable to 
have coherence across sample sizes. To retain this property one would need to “marginalize” 


over all possible locations. This was considered in detail in Muller et ah (2011) (and also 
mentioned in Park and Dunson||2010) when making a PPM covariate dependent. We employ 


ideas developed in Muller et ah (2011) in a spatial setting which produces the following 
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cohesion 


G3(S(..«i) = Mxr(|Sk|)x 



(3.4) 


In Bayesian modeling f HieSh often called the marginal likelihood or prior 

predictive distribution and is used to measure the similarity among the locations belonging to 
cluster h. Therefore, favors partitioned location vectors (s*) that produce large marginal 
likelihood values. To simplify evaluating and retain coherence across sample sizes, q{s\^) 
and g(^) are specihed to form a conjugate probability model. We emphasize however that 
we are not assuming the Sj’s to be random, we are simply employing the conjugate model as 
a means to measure spatial proximity and encourage co-clustering of locations that are near 
each other. Both areal and point referenced data can be considered when C 3 is employed, all 
that is required is specifying appropriate q{s\^) and q{$,)- For example, if point referenced 
data are available, a conjugate Gaussian/Gaussian-Inverse-Wishart model would be appro¬ 
priate. In this case ^ = (m, V) would denote a mean and covariance, q{s\^) = N{s\m, V) 
a bivariate Gaussian density and g(^) = NIW{m,V\^io, ko,^o,-^o) a bivariate Normal- 
Inverse-Wishart density. For areal data a conjugate multinomial/Dirichlet model could be 
utilized. In what follows we focus on point reference case and will occasionally refer to C 3 
as the auxiliary cohesion. Finally, as in the previous two cohesions, M x F(|S'ft|) is included 
to avoid creating many singleton clusters. 


The fourth and hnal cohesion that we consider is similar to what Quintana et ah (In 


press) call a “double dipper” cohesion. It has the same form as C 3 , but instead of employing 


a prior predictive conjugate model, a posterior predictive conjugate model is used. Therefore 
C 4 has the following form 


C4(St,4) = M X r(|s„|) X / J] q(si\iMUsl)di,. 


(3.5) 


ieSh 


Since the posterior predictive is typically more peaked than the prior predictive, C 4 puts 
more weight on partitions that are local. Once again both areal and point referenced data 
are possible, but in what follows we focus on point-referenced and use the following conjugate 
model; N 2 {si\mh, Vh)NIW{mh, Vh\sl). 

Before proceeding we provide more detail regarding the role of the scale parameter (M) in 
sPPM. In Dirichlet process (DP) modeling M regulates the number of clusters and it is fairly 
well known that the expected number of clusters a priori under the DP induced probability 
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distribution on partitions is approximately M log[(M + n)/M]. Thus the number of clusters 
grows slowly as n increases which favors partitions with a small number of large clusters 
(rich get richer). This motivated its inclusion in the four cohesions (without it each cohesion 
would favor partitions with a large number of singletons). However, when M x r(|S'/j|) is 
coupled with distance penalties, it is not clear how the number of expected clusters a priori 
grows as a function of M. We explore this using a small simulation study in the next section. 

3.2 Cluster assignment probabilities 

To investigate how distance influences partition (cluster membership) probabilities we con¬ 
sider the very simple case of n = 2. In this context only two possible partitions exist: ({1, 2}) 
and ({!}, {2}). Table [^provides Pr{p = {1, 2}) for each of the cohesion functions along with 
the limiting probabilities as d{si,S 2 ) —)■ 0 and d{si,S 2 ) —)■ oo. To simplify calculations, for 
the auxiliary and double dipping similarity functions we use no = Sh, kq = 1, z/q = 2, and 
Aq a diagonal matrix of dimension 2 and we will use S = ~ ^h){si — Sh)'- 

Table 1: Prior Partition Probabilities 

d{si, S2) —>■ 0 d{si, S2) —)■ 00 

Cohesion Pr({l,2}) Pr({l,2}) Pr({l,2}) 


Ci{S,,sl) 

1 

1 

1 + M{T{aV,)I[Dh > 1] + VJ[V, < 1]} 

C2{Sh,sI) 

I[d{si,S2) < a] 

1 

/[d(si, S2) ^ n] + M 

1 + M 

C3{Sh,sI) 

1 

1 

l + 2M|Ao + 5|3/2 

1 + 2M 

C,{S,,sl) 

81|Ao + 5|2 

81 

81|Ao + 5|2 + 10M|Ao + 25|3 

81 + lOM 


From Table it can be seen that for all four cohesions the probability that both locations 
are members of the same cluster approaches zero as distance between the two locations 
increases (a quality that is desirable). However, only Ci displays the property that as distance 
between two locations decreases the probability of clustering the two locations approaches 
1. This limiting probability for the other three cohesion functions depends on M and other 
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tuning parameter choices. Of the three, for a fixed M, Pr{{l, 2}) increases as d{si, S 2 ) —?• 0 
quickest for O 4 and slowest for C 2 - To see this let M = 1 (common in DP modeling), then 
as d{si,S 2 ) —)■ 0, Pr({l,2}) approaches 0.5 for O 2 , 0.72 for O 3 , and 0.89 for O 4 . A slightly 
more sophisticated example that further explores partition probabilities is provided in the 
Supplementary Material. 

Figure displays pairwise probabilities of locations belonging to the same cluster for a 
10 X 10 regular grid. Since sPPM under cohesions 1 and 2 are not coherent across sample 
sizes, care must be taken when generating samples from the prior and we use self-normalized 
importance sampling ( Robert and Casella||2009 chap 3) to appropriately reweight partitions 
drawn from the predictive distribution based on Ci and 6 * 2 . M is set to 0.1 for Ci and C 2 
and M = 1 for C 3 and C 4 . For C 2 we set a = 1.77 which is the median distance among all 
pairwise distances, and the tuning parameters associated with Ci, C 3 and C 4 are those used 
previously. From Figure it appears that Ci and C 4 are similar in how distance penalizes 
cluster membership. C 3 allows locations fairly far apart to have positive probability of being 
members of the same cluster. The cut-off boundary for cluster membership associated with 
C 2 is clearly shown. 

To better understand M’s influence on p’s cluster configuration a priori, we ran a small 
simulation study by drawing 5000 partitions from the sPPM for each of the four cohesions. 
The spatial configurations are regnlar 10 x 10, 15 x 15 and 20 x 20 grids resulting in 100, 
225, and 400 spatial locations. (We also considered the spatial configuration found in the 
application of Section 4.2 but results were similar and so are not provided.) The tuning 
parameters are set to the same values as used previously except that both a = 1 , 2 are 
considered for Ci. The results are provided in Table Under the header E{kn) are listed 
the number of clusters in p averaged over the 5,000 prior draws, #sing denotes the number 
of singletons clusters and maxIS'jl denotes the number of members in the largest cluster. 
Notice that setting a = 1.77 for C 2 forces the sPPM to have at least 10 clusters. Also, 
as expected setting a = 2 results in Ci producing more clusters. The number of clusters 
associated with Ci, ( 72 , and C 4 grow at a faster rate than Mlog((M-|-n)/M) while C 3 grows 
at a slower rate. The number of singleton clusters is also very reasonable for M < 1. 


3.3 Modeling Spatial Structure via the Likelihood and Prior 

Given p, the sky’s the limit on how spatial dependence might be modeled via the likelihood. 
A completely valid modeling strategy would be to assume independent observations given p. 
In this case, all spatial dependence would originate from the spatial clustering produced by 
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Figure 3: Pairwise probability matrix of two locations belong to the same cluster for a 10 x 10 
regular grid. M = 0.1 for each cohesion 
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Table 2; Results from simulation study which drew 5,000 partitions from sPPM for each of 
the four cohesions. 


M 

Method 


n = 100 


n = 225 


n = 400 

E{kn) 

#sing 

max Sj 

E{kn) 

#sing 

max Sj 

E{kn) 

#smg 

max Sj 



1.00 

0.00 

100.00 

1.00 

0.00 

224.99 

1.01 

0.00 

399.99 



3.91 

0.03 

37.06 

4.61 

0.01 

66.85 

4.98 

0.00 

106.92 

10“^ 

C2 

10.08 

0.82 

18.18 

11.63 

0.68 

39.11 

13.06 

0.64 

67.59 


Cs 

1.00 

0.00 

100.00 

1.00 

0.00 

225.00 

1.00 

0.00 

400.00 


C4 

1.00 

0.00 

99.98 

1.00 

0.00 

224.99 

1.00 

0.00 

399.96 



1.01 

0.01 

99.96 

1.03 

0.02 

224.93 

3.00 

0.00 

345.00 



4.58 

0.04 

31.04 

5.40 

0.00 

57.28 

7.00 

0.00 

80.02 

10“^ 

C2 

10.11 

0.81 

18.20 

11.65 

0.68 

39.13 

13.08 

0.64 

67.53 


Cs 

1.00 

0.00 

99.99 

1.00 

0.00 

224.98 

1.00 

0.00 

399.92 


C4 

1.00 

0.00 

99.97 

1.00 

0.00 

224.90 

1.00 

0.00 

399.86 



1.16 

0.03 

99.37 

2.17 

0.00 

141.19 

2.77 

0.00 

227.96 



5.50 

0.00 

25.76 

6.76 

0.00 

49.19 

8.08 

0.00 

68.10 

10“3 

C2 

10.10 

0.82 

18.15 

11.65 

0.68 

39.15 

13.05 

0.64 

67.49 


Cs 

1.00 

0.00 

99.93 

1.00 

0.00 

224.85 

1.01 

0.00 

399.52 


C4 

1.02 

0.00 

99.62 

1.02 

0.00 

224.00 

1.02 

0.00 

398.27 



3.00 

0.01 

55.99 

3.18 

0.00 

95.76 

3.00 

0.00 

151.00 



8.43 

0.03 

20.62 

9.51 

0.02 

39.33 

12.93 

0.00 

53.83 

10-2 

C2 

10.17 

0.84 

18.13 

11.72 

0.70 

39.10 

13.20 

0.65 

67.30 


Cs 

1.04 

0.01 

99.22 

1.05 

0.01 

223.42 

1.05 

0.01 

396.73 


C4 

1.16 

0.01 

96.33 

1.17 

0.01 

217.12 

1.19 

0.01 

385.04 



5.91 

0.22 

30.66 

8.87 

0.00 

46.20 

8.50 

0.02 

83.57 



14.12 

0.73 

13.78 

18.98 

0.63 

22.30 

25.03 

0.31 

32.20 

10-1 

C2 

10.89 

1.00 

17.77 

12.69 

0.89 

38.15 

14.34 

0.85 

65.57 


Cs 

1.42 

0.07 

92.84 

1.46 

0.07 

209.11 

1.51 

0.07 

370.28 


C4 

2.22 

0.10 

76.89 

2.40 

0.09 

171.75 

2.52 

0.10 

304.25 



14.96 

1.11 

14.24 

21.66 

0.63 

22.48 

31.03 

1.21 

31.20 



26.50 

2.57 

7.85 

43.98 

3.27 

11.70 

54.80 

1.74 

16.78 

10° 

C2 

17.84 

3.19 

14.54 

22.31 

2.96 

30.38 

26.23 

3.00 

51.37 


Cs 

4.27 

0.72 

62.99 

4.64 

0.70 

141.88 

5.01 

0.71 

249.55 


C4 

7.70 

0.97 

35.91 

9.17 

0.94 

76.42 

10.22 

0.96 

132.32 



36.51 

9.28 

7.06 

60.10 

9.61 

10.12 

85.77 

10.31 

13.30 



52.34 

19.55 

4.46 

92.38 

19.91 

6.68 

137.61 

19.82 

7.87 

lOi 

C2 

46.78 

21.86 

7.21 

70.16 

23.34 

13.27 

92.31 

24.77 

20.80 


Cs 

18.83 

6.59 

25.10 

23.02 

6.80 

56.47 

25.86 

6.89 

99.96 


C4 

27.72 

8.93 

12.88 

37.99 

9.19 

25.10 

46.30 

9.33 

41.22 
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the sPPM. Alternatively, global spatial structure or cluster specific spatial structure may be 
included in the likelihood producing much richer marginal spatial structure. 

To explore spatial dependence further, we consider correlations among two observations 
as distance between them either increases to cx) or decreases to 0. This is done under a few 
likelihood models for each of the cohesions. Letting y = (?/(si),..., j/(s„)), in the absence 
of spatial dependence in the likelihood, the basic model employed is 

kji 

fiy\p) = Y[fh{yh) (3-6) 

h=l 

kn 

h=l 

With fhivl) = /riieS;, f {y{si)\6)dGQ{6) and f{-\6) denoting the likelihood and Gq a prior 
on 6 . Alternatively, the model can be written hierarchically using cluster labels ci,..., in 
the following way 


y{si) I e, d fiO*), for i = 1 ,..., n 


iid 


0 ^ ~ Go, for £ = 1,..., /c„ 


(3.7) 


with 91,... ,91^ denoting cluster specific parameters so that 9i = 9*.. In the spatial setting 
Cl,... ,Cn are dependent multinomial latent variables with component probabilities derived 
from the sPPM. 

When spatial structure is included in the likelihood it is done hierarchically by way of 


introducing spatial random effects, and models (3.6) and (3.7) will need to be adjusted 


accordingly. The spatial random effects can be cluster specific or global. If covariates are 
available, their relationship to the response can also be modeled as being cluster specific 
(local) or not (global). To simplify calculations in what follows we consider a Gaussian 
likelihood by setting f{-\0) = N{-\fi,a‘^). Proofs to all Propositions are provided in the 
Appendix. 


3.3.1 Covariances Under Local Regression 


Proposition 3.1 furnishes the correlation between two observations available from a model 


that incorporates spatial information in the prior only. Therefore, all spatial structure is 
completely produced by the sPPM. 
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Proposition 3.1. Let x{si) = Xi and y{si) = Ui denote a p-dimensional covariate vector 
and response at location Si. Further, let ..., denote cluster specific parameters such 
that /3^ ~ N{pL,T) and assume that p and {f3h}h=i mutually independent. Then under 
likelihood 




(3.8) 


and a sPPM prior for p, the marginal correlation between two observations is 


corr{yi, y^) = 




^x'-Txi + a'^^x'jTxj + a' 


.Pr{ci = Cj). 


(3.9) 


When x{si) = 1 for all i (i.e., no covariates are available) and rs^ iV(/i,r 2 ), ( |3.9D simplifies 
to 


corr{yi,yj) = 


+ a- 


;Pr{Ci =Cj). 


(3.10) 


Remark 3.1. Recall that as d{si,Sj) —)■ oo, Pr{ci = cj) —)■ 0 and therefore corr{yi,yj) —)■ 
0. However, corr{yi,yj) 74 1 as d(sj,Sj) —>■ 0. Although this result does not agree with 
many spatial covariance functions, it does agree with models that include a nugget effect. 
Additionally, from a clustering perspective it makes sense that locations allocated to same 
cluster are assigned the same parameter value, but not necessarily the same response value. 


To visualize (3.10) as a function of distance {d{si,S 2 ) = ||si — S 2 II), consider again the 


case of two locations. In Figure we present correlations that are calculated by hxing 
Si = (0,0) and moving S 2 around in space. We set = 0.1 and = 1 which produces 
1/1.1 0.9 as the maximum correlation. For each cohesion we set M = 1 and use the same 


values for the tuning parameters that were used in Section 3.2 The hard boundary of C 2 is 
evident as correlations produced by C 2 are either zero or 0.5(1/1.1) ~ 0.45. The correlations 
associated with the other three cohesions decrease more smoothly as distances between si 
and S 2 increase. It appears that correlations associated with Ci decay quicker as distance 
increases relative to C 3 and 6 * 4 . The correlations associated with C 3 seem to be the most 
global in the sense that they decay slowly as a function of distance. 

In order to consider simultaneous movement between two observations, in Figurej^si, S 2 G 
M (rather than Si,S 2 G M^). Thus what is seen in Figure]^ are correlations associated with 
d{si,S 2 ) = |si — S 2 |- Once again the maximum correlation is 1 / 1 . 1 . Just as in the previous 


17 












Figure 4: Correlations produced using (3.10) when two locations are considered. Si is set to 
(0,0) and S 2 varies. The maximum correlation available is r^/(r^ + cr^) ~ 0.91 with = 1.0 
and cr^ = 0.1 
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figure, 6 * 2 ’s hard boundary is evident and Ci displays the most extreme correlation values. 
However, perhaps more interesting is the fact that the spatial structures produced by C 3 
and Ca appear to be non stationary and anisotropic as they are not constant in distance nor 
direction. 


3.3.2 Correlations Under Local Regression and Global Spatial Structure 

Proposition |3.2| provides the correlation between two observations from a model containing 
local regression and global spatial structure. 


Proposition 3.2. Let Xi, yi, and ..., (3^^ be as described in Proposition 3.1 Further, Let 

GP(0, denote an n-dimensional vector of a spatial process 

where GP(0, A^P(0)) denotes a Gaussian process with covariance function H{(j)) : xM^ —)■ 

M parametrized by 0 and assume that p, and 6 are mutually independent. Then 

for likelihood 


yi I Xi, 9i, (3*,Ci, cr^ ~ N{x[f3f + 6i, cr^) 


and sPPM for p, the marginal correlation between two observations is 


corr{yi,yj) = 


A^(P(0))ij + x'jTxiPr{ci = Cj) 
y/x'fTxi + A^ + a'^^^x'-Txj + A^ + cr^ 


(3.11) 


(3.12) 


When x{si) 
simplifies to 


1 for all i (i.e., no covariates are available) and 


iid 


N{p,T^), 


(3.12) 


corr{yi,yj) 


A^ 


+ A^ + 


(P(0))ij + 


+ A^ + 


Pr(ci = Cj). 


(3.13) 


Correlations are now a function of covariances from the GP and from spatial clustering. 
Notice that if the variability among cluster means (r^) is large relative to cr^ and A^, then 
cluster probabilities will be extremely influential in marginal correlations. Consider once 
again the simple case of two spatial locations. In this scenario if d{si,S 2 ) —)■ 00 , then 
corr{yi,y 2 ) —)■ 0. While as d{si,S 2 ) —)• 0, then corr{yi,y 2 ) —)■ (A^ + T'^Pr{ci = C 2 ))/(A^ + 
+ (j^). Thus modeling spatial partitions with the sPPM results in decreased correlation 
for locations that have small probability of being co-clustered and an increase for those that 
have high probability relative to GP type spatial structures. 
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3.3.3 Covariances Under Global Regression and Local Spatial Structure 


Proposition 3.3 provides the correlation between two observations for a model with local 
covariance structnre and global regression. 


Proposition 3.3. Let Xi, i/i be as described in Proposition 3.1. Further let (3 

and 6h = {Oi : i E Sh} such that 6h\X\*,(j)l ~ GP{0, . With out loss of generality 

order 0 = (0i,..., 0 ^^) such that 


(3.14) 



\ 


/ 

' XfH{ct>l) • 

0 

\ 




0, 





) 


V 

0 

■■ 

/ 


If spatial random effects (3.14) are combined with likelihood (3.11) and sPPM is employed to 


model p with p, f3, and 6 being mutually independent, then the marginal correlation between 
two observations is 


corr{yi,yj) = 


x',Txi + cov*{9i, 9j) 


+ x'fTxi + var*{9i)^a‘^ + x'-Txj + var*{9j) 


(3.15) 


where cov*{9i,9j) = Ylh=i K*iH{(j)l))ijPr{ci = Cj = h) and var*{9i) = Y!h=w‘h = ^)- 

When x{si) = 1 for all i (i.e., no covariates are available) and jd ~ then (3.15) 

simplifies to 


corr{yi, y^) = 


+ cov*{9i, 9j) 


+ var* (6*j) -^cr^ + + var* {9j) 


(3.16) 


It is interesting to note that covariances are weighted averages of all clnster specihc 
covariances with weights depending on distance. This type of spatial correlation structure 
is clearly nonstationary and nonisotropic. 


4 Simulation Study and Examples 

Except for very specihc examples, the discussion to this point has been fairly generic with the 
idea of explaining different modeling approaches under a general framework. Now we pro¬ 
vide more concrete illustrations by way of a small simulation study and a Chilean education 
application (with additional simulations and applications are provided in the Supplementary 
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Material). The simulation studies and applications will require making some specific mod¬ 
eling assumptions but still within the general class of models thus far presented. To make 
methods invariant to scale of location, in the simulations and applications that follow we 
standardize Si,..., to have mean zero and unit variance. Fitting the models that will be 


described is a straightforward MCMC exercise. The algorithm we employ is based on Neal 


(2000)’s algorithm number 8 and details are provided in the Supplementary Material. 


4.1 Simulation Study 

We conduct a small simulation study to explore sPPM’s ability to recover partitions, make 
predictions and assess its goodness-of-fit performance. This is done by specifying the follow¬ 
ing model 




~ iV(/r;(s,) + x{si)P, a"), a ~ 17iV(0,10), /3 ~ iV(0,10^) (4.1) 

~ N{hq, cTq) ioT h = 1,kn and /xq ~ -^(0 ,10^), cxo ~ t/ N{0,10 ) 


{cd(Li ~ sPPM. 


Here after this procedure will be referred to as the Conditional Model with Prior Spatial 
Structure (CPS). To the CPS we compare the spatial stick breaking (SSB) process found in 


Reich and Fuentes (2007) and a common spatial regression model (SR). More precisely. 


1. The SR model refers to y{si)\x{si), f3,6{si) ~ N{x'{si)f3 + 6{si),a‘^) with x'{si) = 
(l,x(si)), f3 = (/?o,/5i) ~ iV2(0,10^/), [^(si),... ,6'(s„)] ~ GP(0, A2i7(0)), and cr^ ~ 
IG{a,h). 


2. Given cluster labels {ci}r=i 5 SSB can be expressed as j/(sj)|a;(sj), ^ (hci(®i) + 

x{si)/3, a^) where q ~ Categorical(pi(si),... ,pm(si)) with pj(s) = Wj{s)Vj nfc<j[l “ 
Wk{s)Vk] for Vj beta{l, M). The Wj{s) are location weighted kernels that intro¬ 
duce spatial dependence in the model (we always use a Gaussian kernel). Lastly, 
/i^(si) ~ iV(/io,a-o) for h = 1,..., and /xq ~ 77(0,10^), ao ~ UN{0, 10). 


For the CPS we consider the four cohesions. For Ci we set a = 1 and a = 2 and use the 


same tuning parameter values as in Section 3^ for the other three cohesions functions. 

The SSB is included because it is operationally very similar to the sPPM and was fit 


using the R function provided by Reich and Fuentes (2007). Since the function only admits 
models that don’t include likelihood spatial structure, to make comparisons valid, we do not 
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incorporate spatial structure in (4.1). The spBayes package in R ( Finley and Banerjee||2013 ) 
was used to fit the SR model. 

We considered the following four factors. 


1. number of clusters (1, 4) 

2. distribution of e* (iV(0, a^) and 0.5A^(0, cr^) + 0.5A^(1, cr^) with cr^ = 0.1) 

3. value of M 

4. shapes of clusters (square, random) 


The first factor was considered to assess clustering accuracy. Note the the sPPM and SSB will 
by definition create spatially referenced clusters, so we don’t expect high clustering accuracy 
when the number of clusters is 1. But including this level will allow us to assess the CPS 
when the true data generating mechanism is much simpler. Factors 2 and 3 are included to 
assess robustness of predictions and of goodness-of-fit against possible model perturbations. 
Factor 3 will only influence CPS and is included to investigate how calibrating sPPM is 
cohesion dependent. 

To create synthetic data we employed the following as a data generating mechanism 


y{si) = /i*.(si) + x{si)l3 + e{si) + e(si) 
0 = [0(sO,..., 0(sn)] ~ G'P(0, r2Ff(0)). 


An exponential covariance function with = 2 and 0 = 6 was used to create H{(p). 


Locations (si,..., s„) were generated in two ways. The first method set Sj ~ UN{0, 1) x 
UN{0,1) with clusters being created by partitioning the simplex into four equal area 


squares and assigning Sj accordingly. For the second method we set s 

i EL.0-25iV(m,s^), 

The MixSim R function (see Melnykov et ah 2012) was employed to generate locations from 
the mixture. For data containing four clusters, values of the cluster specific intercepts were 
H* = (0,1, —1, —2). We set 0 = 1 for all data sets and used UN{0, 10) to generate x values. 


To obtain of point estimates for p we employed the least squares procedure proposed in |Dahl 

p006|). 


For each combination of factor levels D = 100 data sets containing 100 training and 100 
testing observations were generated. For each data set, the SSB, SR and sPPM procedures 
were fit to data by collecting 1000 MCMC iterates after discarding the first 1000. Results 
for M = 0.01, M = 0.1, and M = 1.0 are presented in tabular form and can be found in 
Tables and (results for other values of M are provided in the Supplementary Materials 
file). The columns of both tables correspond to the following 
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• RAND; represents the adjusted Rand index which measures proximity of estimated 
partition to the true partition. An adjusted Rand index close to 1 indicates a good 
match between estimated and true partition. The values found in the Table are the 
adjusted Rand index averaged over the D = 100 data sets. 

• MSPE: represents the mean squared prediction error defined as “ 

Yplsdi))"^ where* indexes the 100 testing observations {Yp{s)) andhj,(sjrf) = E{Yp{sdi)\Y{s)). 
This quantity measures the predictive performance of the models. The values found in 
Tables and 1^ are the MSPE averaged over the 100 data sets. 


LPML: represents the log pseudo marginal likelihood which is a goodness-of-fit metric 
(see 


Christensen et ah 2011) that takes into account model complexity. The values in 


the two tables are average LPML over the 100 data sets. 


Table provides results for data that contain four clusters. First notice that for Ci the 
model fit associated with CPS declines as M decreases, but prediction accuracy and Rand 
index values improve. This indicates that M must be small for Ci or CPS tends to overfit 
by creating many clusters. For C 3 it appears that the opposite is true. Setting a = 2 for Ci 
seems to reduce overfitting as model fit is slightly worse but out of sample prediction greatly 
improves. It seems like C 4 is the best at making accurate predictions regardless of the value 
of M, but selecting an appropriate M is clearly cohesion dependent (something we explore 
more in the Supplementary Material). Interestingly CPS (and SSB) predict slightly better 
when error is a mixture and clusters are not regular. All that said, perhaps the main take 
home message is that CPS produces more accurate predictions and better data fit relative 
to SSB and SR for almost all data generating scenarios and cohesions. 

Table provides results for data with no clusters. Notice that we do not report the 
Rand index in this scenario as the CPS and SSB by construction create clusters. Because 
of this, as expected, the one cluster partition is not recovered well. That said, this scenario 
allows us to assess over-fit properties as the data structure is much simpler. It turns out that 
the model fits associated with data that contain no clusters are similar to those produced 
with data contained four clusters. However, the MSPE values are slightly better (which was 
expected). Generally speaking, it appears that CPS continues to perform well relative to 
SSB for each of the cohesions and SR (it is a bit surprising that SR does not perform much 
better). 
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Table 3; Simulation study results when data are generated with four clusters. 






M = 1.0 



M = 0.1 



M = 0.01 


Error 

Cluster 

Method 

RAND 

LPML 

MSPE 

RAND 

LPML 

MSPE 

RAND 

LPML 

MSPE 



CPS Ci^_, 

0.05 

-169.73 

2.75 

0.09 

-172.61 

2.45 

0.16 

-178.07 

2.43 



CPS Ci^_, 

0.06 

-183.36 

2.47 

0.12 

-179.09 

2.40 

0.18 

-180.24 

2.26 



CPS C 2 

0.16 

-183.49 

2.34 

0.37 

-182.49 

2.24 

0.49 

-182.78 

2.32 


Square 

CPS C 3 

0.52 

-183.21 

2.37 

0.50 

-184.24 

2.28 

0.43 

-184.09 

2.41 



CPS C 4 

0.29 

-179.39 

2.29 

0.51 

-180.74 

2.18 

0.59 

-181.58 

2.27 



SSB 

0.15 

-189.46 

3.50 

0.16 

-190.22 

3.37 

0.13 

-189.45 

3.39 

Gaussian 


SR 

- 

-2669.12 

22.27 

- 

-2501.09 

21.93 

- 

-2804.15 

22.02 


CPS Ci^_, 

0.07 

-166.78 

2.55 

0.14 

-173.83 

2.39 

0.27 

-176.76 

2.28 



CPS 

0.09 

-176.04 

2.42 

0.17 

-175.51 

2.16 

0.28 

-177.87 

2.11 



CPS C 2 

0.25 

-183.70 

2.35 

0.46 

-183.52 

2.30 

0.52 

-183.28 

2.32 


Irregular 

CPS C 3 

0.64 

-181.00 

2.24 

0.58 

-183.06 

2.33 

0.57 

-182.55 

2.30 



CPS C 4 

0.63 

-176.68 

2.07 

0.73 

-178.89 

2.13 

0.74 

-178.99 

2.09 



SSB 

0.20 

-183.92 

2.91 

0.17 

-183.73 

2.86 

0.19 

-184.44 

2.87 



SR 

- 

-2460.53 

21.04 

- 

-2267.52 

21.62 

- 

-2632.71 

21.36 



CPS 

0.05 

-169.89 

2.62 

0.09 

-172.04 

2.43 

0.16 

-176.90 

2.36 



CPS 

0.06 

-179.92 

2.54 

0.11 

-179.26 

2.36 

0.19 

-178.36 

2.18 



CPS C'z'" 

0.16 

-183.50 

2.27 

0.36 

-181.42 

2.24 

0.47 

-182.74 

2.28 


Square 

CPS C 3 

0.52 

-183.27 

2.25 

0.47 

-183.02 

2.29 

0.43 

-184.64 

2.35 



CPS C 4 

0.29 

-179.05 

2.18 

0.50 

-179.88 

2.18 

0.57 

-181.92 

2.21 



SSB 

0.16 

-189.17 

3.36 

0.16 

-189.33 

3.40 

0.15 

-188.22 

3.35 

Mixture 


SR 

- 

-2320.54 

22.40 

- 

-2383.69 

22.17 

- 

-2400.44 

21.91 


CPS Ci^_, 

0.07 

-170.99 

2.61 

0.17 

-176.83 

2.46 

0.27 

-176.37 

2.27 



CPS 

0.10 

-179.31 

2.40 

0.18 

-176.41 

2.29 

0.29 

-176.05 

2.20 



CPS C 2 

0.22 

-185.50 

2.48 

0.46 

-184.50 

2.41 

0.54 

-182.95 

2.30 


Irregular 

CPS C 3 

0.60 

-183.57 

2.33 

0.56 

-184.98 

2.36 

0.58 

-182.77 

2.27 



CPS C 4 

0.61 

-178.93 

2.14 

0.72 

-180.52 

2.13 

0.77 

-178.58 

2.07 



SSB 

0.18 

-184.78 

3.01 

0.19 

-185.32 

2.98 

0.19 

-184.24 

2.96 



SR 

- 

-2445.62 

21.61 

- 

-2412.06 

21.67 

- 

-2420.39 

21.71 
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Table 4: Simulation study results when data are generated with one cluster. 





M = 

1.0 

M = 

0.1 

M = 

0.01 

Error 

Cluster 

Method 

LPML 

MSPE 

LPML 

MSPE 

LPML 

MSPE 



CPS Ci^_, 

-168.99 

2.06 

-172.97 

2.09 

-174.82 

2.08 



CPS 

-171.94 

2.01 

-173.14 

1.96 

-172.66 

1.92 



CPS C 2 

-176.97 

2.02 

-177.12 

2.07 

-177.98 

2.06 


Square 

CPS C 3 

-178.06 

2.07 

-178.77 

2.12 

-179.18 

2.15 



CPS C 4 

-175.33 

2.01 

-176.70 

2.05 

-178.15 

2.07 



SSB 

-175.18 

2.10 

-176.30 

2.13 

-176.49 

2.14 

Gaussian 


SR 

-2275.31 

19.99 

-2803.85 

19.59 

-2504.16 

20.11 


CPS Ci^_, 

-165.90 

1.98 

-170.31 

1.96 

-174.71 

1.95 



CPS Ci^_, 

-168.86 

1.88 

-169.64 

1.85 

-170.12 

1.76 



CPS C 2 

-175.76 

1.96 

-174.65 

1.95 

-176.82 

1.95 


Irregular 

CPS C 3 

-176.33 

1.98 

-175.54 

1.99 

-177.61 

2.01 



CPS C 4 

-173.47 

1.89 

-173.52 

1.94 

-176.12 

1.95 



SSB 

-175.12 

2.06 

-174.91 

2.07 

-175.11 

2.07 



SR 

-1913.70 

19.58 

-1902.62 

20.13 

-2115.85 

19.77 



CPS Ci^_, 

-172.31 

2.14 

-172.95 

2.08 

-176.83 

2.01 



CPS 

-179.92 

2.00 

-179.26 

2.04 

-178.36 

1.97 



CPS ^ 2 “ 

-178.38 

2.11 

-177.22 

2.04 

-178.46 

1.99 


Square 

CPS C3 

-179.00 

2.15 

-178.31 

2.12 

-179.95 

2.06 



CPS C4 

-177.00 

2.07 

-176.30 

2.02 

-178.80 

1.99 



SSB 

-177.51 

2.21 

-176.22 

2.17 

-177.21 

2.10 

Mixture 


SR 

-2470.62 

19.47 

-2776.41 

19.97 

-2532.80 

19.16 


CPS 

-168.59 

2.00 

-167.94 

1.90 

-172.75 

1.96 



CPS 

-168.61 

1.84 

-168.21 

1.84 

-170.23 

1.81 



CPS C2 

-175.51 

1.98 

-173.57 

1.90 

-175.64 

1.98 


Irregular 

CPS C3 

-176.13 

2.00 

-174.25 

1.94 

-176.14 

2.01 



CPS C4 

-173.75 

1.93 

-172.53 

1.88 

-175.09 

1.97 



SSB 

-175.18 

2.12 

-173.83 

2.02 

-175.20 

2.08 



SR 

-2040.83 

19.94 

-2291.49 

19.47 

-1847.70 

20.25 
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Latitude 


4.2 Application: Chilean Standardized Testing 


Over the past 25 years Chile’s Ministry of Education has established a national large-scale 
standardized test called SIMCE (Sistema de Medicion de la Calidad de la Educacion, System 
Measurement of Quality of Education). It was introduced during the later part of the 80’s 
and since then has continually grown in scope and scale and is now a key component of 


Chilean educational policies (Meckes and Carrasco 2010, Manzi and Preiss 2013). During 
the early part of the 80’s education was privatized in Chile affording parents a great deal of 
flexibility when deciding to which school to send their children. One of the purported roles 
of SIMCE is to aid parents in making this decision. In addition to administrating the exam 
other socio-economic variables are recorded. Among them is mother’s education level which 
is known to influence individual SIMCE scores. Therefore, we include mother’s education as 
a covariate in modeling. 

We briefly note that accommodating spatial dependence in education studies has only 


very recently been considered. In fact, the one article we found is Neelon et ah (2014). 
They explore regional differences in end of grade test scores in North Carolina using county 
level data. This was done by modeling reading and math scores jointly through a fairly 
sophisticated joint conditional autoregressive model. 



■70.8 -70.7 -70.6 -70.5 

Longitude 


SIMCE 



1.5 

• 1.5 



■70.8 -70,7 -70.6 -70.5 
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16 



1.5 

• 1.5 


Figure 6: Spatial plots of SIMCE math scores and mother education level. The left figure 
corresponds with average SIMCE math scores, while the right average mothers education 

level. 
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We were given access to individual 2011 SIMCE 4th grade math scores. To simplify 
the analysis, instead of analyzing individual test scores and mother’s education level, we 
compute school-wide averages for both variables. The longitude and latitude of each school 
was recorded and we focus only on those schools that are located in the greater Santiago 
area (which produced 1215 schools). Figure provides a spatial plot for both SIMCE and 
mother’s education values. Notice that schools in the north east part of the city tend to 
have higher SIMCE scores than those in the south and west. Mother’s education level also 
varies spatially with lower levels generally appearing in the west and south of Santiago. 
An exploratory analysis was performed to investigate spatial structures in the SIMCE data 
results of which are provided in the Supplementary Material. 

To demonstrate the flexibility of pairing the sPPM with a variety of likelihoods, in what 
follows we detail and compare three reasonable models that could be proposed for the SIMCE 
data. In each case, SIMCE scores and mother’s education are standardized to have mean 
zero and unit standard deviation and the proposed model was £t to data by collecting 1000 
MCMC iterates after discarding the hrst 10,000 as burn-in and thinning by 20. Convergence 
was monitored graphically. The MCMC chains mixed reasonably well and converged quickly. 

To assess out of sample prediction, we divided the 1215 schools into 600 training obser¬ 
vations and 615 testing observations. This partitioning of the data also facilitated a cross- 
validations study (see Supplementary Material) that in addition to information gleaned from 
the simulation study resulted in setting M equal to 5 x 10“®, 0.1, 1.0, and 0.5 for cohesions 
1-4 respectively. For Ci both a = 1 and a = 2 were considered, but only results from a = 1 
are reported as a = 2 produced very similar fits. The tuning parameters associated with 
other cohesions are those employed previously. 


4.2.1 Conditional Model 


In order to compare hts and predictions associated with sPPM to those of SSB, our hrst 
modeling approach is to model SIMCE scores conditional on mother education level with 


spatial structure in the prior only. This model corresponds to the CPS model of Section 4.1 


To compare model £t we once again employ LPML (see Christensen et al']|2011 ), but now 
also include MSE = ^ ^^e Watanabe-Akaike information criterion 


(WAIC) which is a fairly new hierarchical model selection metric advocated in Gelman et ah 
(2014). The MSPE associated with the 615 testing observations is also provided under 


the “MSPE” column of Table Excluding Cs, it appears that CPS hts the data better 
than SSB. Additionally, CPS appears to make more accurate predictions compared to SSB 
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with (^4 producing the most accurate. CPS with Ci clearly fits the data best and produces 
competitive predictions. 


Table 5: Model fit comparisons associated with SIMCE test score data for sPPM and SSB 


Procedure 

WAIC 

LPML 

MSE 

MSPE 

CPS Cl 

2113.64 

-1314.21 

0.12 

0.533 

CPS C 2 

2420.56 

-1358.97 

0.21 

0.535 

CPS C 3 

2739.73 

-1364.31 

0.48 

0.538 

CPS Ci 

2706.71 

-1361.58 

0.40 

0.516 

SSB 

2733.40 

-1387.91 

0.48 

0.536 


For the CPS procedure predicting an average SIMCE score for a completely new school 
requires knowing the new school’s location and mother’s education level. One approach would 
be to discretize mother’s education into, say, three levels and create a predictive map for 
each one. An alternative approach would be to first predict mother’s education level for the 
new school, then use the predicted mother’s education level as covariate to predict SIMCE. 
Using the later approach, the 600 training observations, and a regular grid of locations that 
belonged to the convex hull created by the observed school locations, we predict SIMCE 
scores by first predicting mother’s education level using a model similar to CPS but free of 
covariates, (i.e., z(si)|p, cr^ ~ (sj), a^) where z{si) denotes mother’s education level 

at the ith new school.) The predictive map of mother’s education values and SIMCE scores 
is provided in Figure]^ (we only report predictions from Ci as the others were similar). The 
predicted values of mother’s education level and SIMCE math scores are completely plausible 
and the resulting spatial structure follows the general social-economic spatial distribution 
that is known to exist in Santiago. 


4.2.2 Joint Model 


Making predictions with the previous model is somewhat awkward as mother’s education 
needs to be either fixed or predicted using a completely different model. A more natural 
and coherent modeling approach for this application would be to model SIMCE scores and 
mother’s eduction jointly as both could be thought of as random quantities. To demonstrate 
flexibility in which sPPM can be incorporated in modeling and because comparisons to the 
SSB are not available for the joint model, we include spatial structure in the likelihood which 


amounts to using a simple coregionalization model (Banerjee et ah {20 14 Chapter 9). Now let 
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Figure 7; Predictive maps for mother’s education and SIMCE scores. The predicted mother’s 
education levels were used to predict SIMCE 


y{si) = [?/i(sj), y 2 {si)]' denote the ith. school’s average SIMCE score and mother’s education 
level and consider the following data model 


y{si) = + e{si) + e{si), i = 


(4.2) 


where ^ cluster specihc 2 -dimensional intercept vector whose 

spatial structure is guided through a sPPM prior, 0(sj) = ( 6 *i(sj), 6 * 2 (si))' is a two-dimensional 
intercept whose spatial structure is directly incorporated into the likelihood in a manner that 
will be described shortly, and e(sj) ~ iV 2 ( 0 , S) is an error term. S contains dependence 
structure between SIMCE and mother’s education with variances denoted by af and and 
covariance ai 2 = ycricr 2 - For h = 1,...,we assume y-l{si) ~ N 2 {i-io,T). To address 
spatial structure for each variable and the dependence that may exist between these two 
spatial processes, instead of modeling 6 *(sj) and 6 * 2 (si) directly with a Gaussian process we 
instead introduce ( 6 *j(si), 0 j{s2 ),..., 9j{Sn)) ~ GP{0, Cj) independently for j = 1 , 2 and set 


\ 6^2(54) 




where A 


1 

7 


7 

1 
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for 7 G (0,1). Cj of the Gaussian process denotes a valid covariance matrix constructed 
using an exponential covariance function. Thus, the (£, £')th entry of (Cj) is = 

rj exp{—0j||s£—S£/||}. Prior distributions employed are ~ Gamma(l,l), (j)j ~ 17A^(0.5,30) 
(this implies a t/iV(0.1,6) for effective range), Ho ~ iV2(0,10^/), T ~ Jiy(2,/), and 
S ~ IW{2,I). We use IW{u,A) to denote an inverse Wishart distribution with scale 
and matrix parameters z/ and A. 

Under this model prediction of the SIMCE math score for a new school located at Sq is 
easily made via 2 /i(so)||/ 2 (so) which has the following form 

l/i(so)|2/2(so) ~ N (^oeo(so) + /3i*|/2(So), u?(1 - 7^)) , 

with 131 = and + 0 i(so) - AIhLo + ^ 2 (^ 0 )]• 

For this procedure to be useful, predictions of 6 ^ 1 ( 80 ), 6 ^ 2 ( 50 ), and 2 / 2 (so) are 

needed. Values for fil and ^2 are readily available once cq is classihed by way of the predic¬ 
tive distribution found Section 2 of the Supplementary Material (equation S.l). Values for 
[ 6 'i(so), 6 ' 2 (so)] are obtained by Erst predicting [0i(so), ^ 2 (^ 0 )] from 0i(so)|^i(si), • • ■ ,^i(sn) 
and 02 (so)I^ 2 (^ 1 ), • • • , 02 (sn) independently and then setting [ 0 i(so), 02 (so)]' = A[ 0 i(so), 02 (so)]'- 
Finally, using the fact that |/ 2 (so) ~ -^(hLo + ^ 2 (^ 0 ), cr|) a prediction for |/ 2 (so) is easily ob¬ 
tained. We will refer to the procedure just described as the Joint model with Likelihood 
Spatial Structure (JLS) model. 

JLS can become computationally expensive as the number of schools grows. Incorpo¬ 
rating spatial information solely in the prior would radically reduce computation time, but 
potentially at the cost of model ht. To investigate this trade off, we also consider 

y{s,)\fi\ c, iV 2 (^;(s,), S) for * = 1,..., n and S ~ /W(2, 1) 

~ iV 2 (AXo,T) with T ~ IW{2,I) 
y,Q ~ ^ 2 ( 0 , 10 ^/) 

{QjjLi ~ sPPM. 


As in the JLS, predictions at location Sq are also easily made via F[|/i(so)|?/ 2 (so)] = hico(®o) + 
h^[l/ 2 (so) - y* 2 coi^o)]- Values for /ijco(so), hLo(®o), and 1 / 2 (sq) are gathered using the pro¬ 
cedure described for JLS. We will refer to this model as the Joint model with Prior Spatial 
Structure (JPS). 

Using the same M values as in Section[4.2.1|we fit JLS and JPS to the training data and 


31 



Table 6: Model fit comparisons for the JPS and JLS models fit to the SIMCE education 
data set. 


Procedure 

WAIC 

LPML 

MSE 

MSPE 

Clusters 

Time 

JPS Cl 

2312.503 

-1383.301 

0.380 

0.586 

35.767 

2154 

JPS C 2 

2569.589 

-1438.750 

0.415 

0.590 

34.746 

4621 

JPS Ca 

2778.803 

-1447.872 

0.482 

0.591 

8.921 

598 

JPS Q 

2552.333 

-1399.899 

0.433 

0.600 

26.750 

1090 

JLS Cl 

2047.319 

-1291.011 

0.244 

0.574 

34.992 

38017 

JLS C 2 

2266.945 

-1342.172 

0.258 

0.569 

34.249 

41022 

JLS Cs 

2553.984 

-1376.176 

0.365 

0.573 

6.789 

38538 

JLS Q 

2273.479 

-1331.949 

0.334 

0.606 

26.952 

37565 


carried out prediction using the same grid of points and the testing data. Comparisons of 
the two joint models regarding model fit and computation time are provided in Table The 
column “Clusters” is the expected number of clusters a posteriori and “Time” is the amount 
of computing time required to fit models (measured in seconds). MSPE is associated with 
the 600 testing observations. As expected fits using JLS are much better for all cohesion 
functions but at a substantial computational cost. However, JPS out of sample predictions 
are fairly competitive to those from JLS and may be considered if a timely answer is needed. 

Maps associated with predictions made using JPS and JLS are provided in Figures 
and 1^ For JPS the four cohesions produce fairly different predictive surfaces, while for 
JLS the surfaces are very similar among the four cohesions. This illustrates that including 
spatial structure in the likelihood greatly impacts the predictive maps. For both procedures, 
the predictive maps identify the same general areas that contain higher SIMCE scores, but 
changes in SIMCE scores as a function of space are far more pronounced for JLS. This may 
be indicating that predictions are more local for JLS relative to JPS. 

5 Conclusions 

We have proposed a general procedure that extends PPMs to a spatial setting providing a 
mechanism to directly model the partitioning of locations into spatially dependent clusters. 
This mechanism in turn provides a means to introducing sophisticated spatial structures 
in modeling in a straightforward fashion. The cohesion function of the sPPM affords a 
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Figure 8: Predictive maps associated with JPS for each of the four cohesion functions 
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Figure 9: Predictive maps associated with JLS for each of the four cohesion functions 
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great deal of flexibility regarding the type of spatial clusters available and the four that 
we have proposed are certainly not exhaustive. Other functions can be developed that 
produce different types of spatial structures. The simulation study and application showed 
that the methodology is particularly well suited for predictions and the fact that spatial 
information can be incorporated in the prior and likelihood allows for added flexibility in 
how spatial structure is modeled, providing the added benefit of capturing local structure. 
Exactly how to join local spatial structure so that global maps are smooth and continuous 
(if so desired) is a topic of ongoing research. Although not explicitly considered, including 
covariate information in the clustering mechanism in addition to spatial information should 


be a natural extension of work developed in Muller et ah (2011). 
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Appendices 

A Marginal Correlation Proof 


We provide a detailed proof of Proposition 
very similar arguments. 


3.2 and 3.3 


The proof of Proposition |3.1 follows 
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A.l Proof of Proposition 2 

Proof. From the law of total covariance 


= co'Up.y.elE(yi\p, 13,9), E(yj\p, 0,9)] + Ei,. 0 y{cau(y,, yj,\p, 0,9)] 

= E,. 0 .ol(x', 0 l + e,){x] 0 f + ».)| - E,Mx‘, 0 l + e,\E,. 0 .f,[x] 0 f + 0,1 + 0 
= E,. 0 .,[]x] 0 i:)[x'fii.) + )x] 0 i:)e, + o,(x'/3;.) + oa] - E,. 0 [x] 0 i\E,Ax] 0 ifi 
= E,0[(x]0i:)]xr0l_)] + Eolm] - E,0[x'fif\E,0[x]0l^ 

^Y.E 0 [tT{ 0 lx,x'^ 0 l^}\Pr{f) - (j 2 ^l^^<K.]Pr{p)] (j 2 ^l>^^'oK,]Pr(p) 

P \ P / \ p / 

= '^Ei 3 [tr{xix'jP*^P*Jj]Pr{p) - a;'^Pr(p)j (^x'jfiPr{p)^ +cov{9i,9j) 

= ^ tr{xix'j{T + fifj,')}Pr{p) + ^ tr{xix'j{fifi')}Pr{p) - fi'xfx'jfi + cov{9i,9j) 


cov{9i, 9j) 


p:Ci=Cj 




Pr{p) + cov{9i, 9j) 


p:Ci=Cj 


= x'jTxiPr{ci = Cj) + 


Now using the law of total variance 


var{yi) = Ep^i3^e[var{yi\p, P,0)] + varp^i3^0[E{yi\p, P,0)] 
= + var p^fi^e[x'iPl. + 9i] 

= (T^ + + x'/Txi. 


Using corr{yi,yj) 


cov{yi,yj) 

Vvar{yi)^yvar{yj) 


completes the proof. 


□ 
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A.2 Proof of Proposition 3 

Proof. Following similar arguments from the previous proof, 

covivi, Vj) = coVp,f3,e[E{yi\p, f3, 6), E{yj\p, f3,6)] + Ep^f3^e[cov{yi, yj, \p, f3, 6)] 
= x[Txj+ ^ cov{9i,9j)Pr{p) + ^ cov{9i,9j)Pr{p) 

p‘Ci Cj p'.Ci^Cj 

= x[Txj+ ^ cov{9i,9j)Pr{p) 

p-C{ —Cj 
hfi 

^<Tx,+Y. Z >l{H(4>u))i.iP-r(p) 

h=l p:ci=Cj=h 
kn 

^<Tx., + Y.\1{h(4,, Y. 

h=l p:Ci=Cj=h 

k-n 

= x-Txj + \l{H{<i)h))i,jPr{ci = Cj = h) 
h=l 

And now using the law of total variance 


var{yi) = Ep^i3^e[var{yi\p, P.O)] + / 3 , 0 )] 

= + x\Txi + Yj vare{9i)Pr{p) 

p 

kji 

= + x[Txi + var{9i) "Y^ Pr{p) 

h=l p:Ci=h 

kji 

= + x[Txi + Y = h) 

h=l 


Using corr{yi,yj) 


cov{yi,yj) 

\/var{yi)^yvar{yj) 


completes the proof. 


□ 
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